Learning to Extract Protein–Protein Interactions using Distant Supervision

نویسندگان

  • Philippe Thomas
  • Illés Solt
  • Roman Klinger
  • Ulf Leser
چکیده

Most relation extraction methods, especially in the domain of biology, rely on machine learning methods to classify a cooccurring pair of entities in a sentence to be related or not. Such an approach requires a training corpus, which involves expert annotation and is tedious, timeconsuming, and expensive. We overcome this problem by the use of existing knowledge in structured databases to automatically generate a training corpus for protein-protein interactions. An extensive evaluation of different instance selection strategies is performed to maximize robustness on this presumably noisy resource. Successful strategies to consistently improve performance include a majority voting ensemble of classifiers trained on subsets of the training corpus and the use of knowledge bases consisting of proven non-interactions. Our best configured model built without manually annotated data shows very competitive results on several publicly available benchmark corpora.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks

Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...

متن کامل

Learning to Extract Relations from the Web and Biomedical Corpora

Automatically identifying semantic relationships between entities mentioned in text documents is an important task in natural language processing. The set of relevant relationships can be very diverse, ranging from company acquisitions mentioned in web documents to interactions between human proteins as mentioned in biomedical articles. In this talk I will describe two approaches to learning re...

متن کامل

Protein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches

DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...

متن کامل

Distant Supervision for Relation Extraction beyond the Sentence Boundary

The growing demand for structured knowledge has led to great interest in relation extraction, especially in cases with limited supervision. However, existing distance supervision approaches only extract relations expressed in single sentences. In general, cross-sentence relation extraction is under-explored, even in the supervised-learning setting. In this paper, we propose the first approach f...

متن کامل

Committee-based Selection of Weakly Labeled Instances for Learning Relation Extraction

Manual annotation is a tedious and time consuming process, usually needed for generating training corpora to be used in a machine learning scenario. The distant supervision paradigm aims at automatically generating such corpora from structured data. The active learning paradigm aims at reducing the effort needed for manual annotation. We explore active and distant learning approaches jointly to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011